In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Please click here to toggle on/off the code"></form>''')
Out[1]:

Abstract

Road accidents has been a critical problem as every year more than 1.2M people die across the globle. There is a pressing need to make use of the data and understand the underlying cause of problem. Road safety issues are complex. There are significant differences in policies within and across the countries. In this analysis, the data from Metropolitan Police Department's (MPD) crash data management system (COBALT) is studied to find relationship between fatality and independent features. The crash data is for DC state.

Each year, more than 1.2 million people die across the globe due to road crashes; there is a pressing need to understand the underlying cause of the problem. As road safety issues are complex; it involves multi-sectorial ranging from the public, stakeholders to the policy makers. Significant differences exist both across and within countries and therefore policies and interventions need to be adapted to the local environment. The effectiveness of interventions requires a multi-disciplinary approach which include enforcement, engineering and psychological and education approaches. While the resources are limited, road safety interventions must not only address the sustainability of the outcomes but also the cost-effectiveness to implement and maintain it. More important, interventions must be evidence-based and can be evaluated over time before it is translated into policy. Hence, the research cannot be done in silo for better addressing the complexity of road safety issues. For sustainability, road safety interventions need to be guided and governed by policy in the implementation and development.

Pipeline

Data Preprocessing

In [5]:
# data = pd.read_csv('drive/MyDrive/Crash_Details_Table.csv', delimiter='\t')
data = pd.read_csv('Crash_Details_Table.csv', delimiter='\t')
data.head()
C:\Users\admin\AppData\Roaming\Python\Python37\site-packages\IPython\core\interactiveshell.py:3146: DtypeWarning: Columns (2) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[5]:
id crime id ccn person id person type age fatal major injury minor injury vehicle id vehicle type ticket issued License state impaired speeding
0 438194351 26872544 16034312 84628234 Passenger 31.0 N N N 2275009 Passenger Car/automobile N VA Y N
1 438194352 26872544 16034312 84833902 Passenger 31.0 N N N 2275009 Passenger Car/automobile N VA Y N
2 438194353 26872544 16034312 84938064 Driver NaN N N N 2275007 Passenger Car/automobile N None N N
3 438194354 26872544 16034312 84790164 Driver 31.0 N N N 2275009 Passenger Car/automobile N VA N N
4 438194355 26872544 16034312 84953497 Passenger 47.0 N N Y 2275008 Passenger Car/automobile N VA Y N
In [6]:
data.describe()
Out[6]:
id crime id person id age
count 5.963810e+05 5.963810e+05 5.963810e+05 426744.000000
mean 4.384924e+08 2.672116e+07 8.506922e+07 38.668302
std 1.721813e+05 1.238390e+06 8.613766e+06 20.897059
min 4.370014e+08 2.341134e+07 1.045383e+07 -7990.000000
25% 4.383433e+08 2.532167e+07 8.474899e+07 27.000000
50% 4.384924e+08 2.680585e+07 8.497752e+07 37.000000
75% 4.386415e+08 2.769386e+07 8.712287e+07 51.000000
max 4.387906e+08 2.872803e+07 9.077153e+07 237.000000

Outlier treatment

In [7]:
data['age'] = np.where(data['age']<1, np.nan, data['age'])
fig = px.histogram(data, x="age", marginal="box")
fig.show()
In [8]:
data['age'] = np.where(data['age']>85, np.nan, data['age'])
fig = px.histogram(data, x="age", marginal="box")
fig.show()

Age variables has lot of outliers. From the box plot values below 0 and above 85 are considered as outliers and replaced by na

Missings values detection and treatment

Only age column shows 30% of missing data which is replaced by 0

In [132]:
def missing_value_checker(df):
    """
    The missing value checker

    Parameters
    ----------
    df : dataframe
    
    Returns
    ----------
    The variables with missing value and their proportion of missing value
    """
    
    variable_proportion = [[variable, df[variable].isna().sum() / df.shape[0]] 
                           for variable in df.columns 
                           if df[variable].isna().sum() > 0]

    print('%-30s' % 'Variable with missing values', 'Proportion of missing values')
    for variable, proportion in sorted(variable_proportion, key=lambda x : x[1]):
        print('%-30s' % variable, proportion)
        
    return variable_proportion
variable_proportion = missing_value_checker(data)
data['age'] = data['age'].fillna(0)
Variable with missing values   Proportion of missing values
age                            0.309054111381818

Column Creation

  • Mask columns boolean fields fatal, impaired, speeding and ticket issued to 1s and 0s
  • Fatality Rate - at an accident level, number of people faced fatal accidents/number of total people involved in the accidents
  • Severity level - Club columns for minor, major and fatal accidents with indicator labels
  • Map US states into 4 regions designated by US census
In [133]:
#mask fatal as 1 and 0
data['fatal'] = np.where(data['fatal'] == 'Y', 1, 0)
data['impaired'] = np.where(data['impaired'] == 'Y', 1, 0)
data['speeding'] = np.where(data['speeding'] == 'Y', 1, 0)
data['ticket issued'] = np.where(data['ticket issued'] == 'Y', 1, 0)
In [134]:
fatalityRate = data.groupby(['crime id'], as_index=False).agg({'fatal':['count','sum']})
fatalityRate.columns = ['crime id', 'fatal_count', 'fatal_sum']
fatalityRate['fatality rate'] = (fatalityRate['fatal_sum']/ fatalityRate['fatal_count'])*100
data = pd.merge(fatalityRate[['crime id', 'fatality rate']], data, how='inner', on='crime id')
In [136]:
# Club classes together
conditions = [
    (data['fatal'] == 1),
    (data['major injury'] == 'Y'),
    (data['minor injury'] == 'Y')]
choices = ['fatal','major', 'minor']
data['severity'] = np.select(conditions, choices, default=0)
In [171]:
usMapping = pd.read_csv('https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv')

data = pd.merge(data, usMapping[['State Code', 'Region']], left_on='License state', right_on='State Code', how='left')

EDA

In this analysis, fatal is the target variable. As indicated below the dataset is heavily imbalanced which is treated futhur. Only 0.06% (~417) are fatal

In [139]:
print("Percentage of non fatal class {} and fatal class {} ".format(
    round((data.groupby(['fatal'], as_index=False).agg({'fatal':'count'})['fatal'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['fatal'], as_index=False).agg({'fatal':'count'})['fatal'][1]/ data.shape[0])*100, 2)))

print("Percentage of non major class {} and major class {} ".format(
    round((data.groupby(['major injury'], as_index=False).agg({'major injury':'count'})['major injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['major injury'], as_index=False).agg({'major injury':'count'})['major injury'][1]/ data.shape[0])*100, 2)))

print("Percentage of non minor class {} and minor class {} ".format(
    round((data.groupby(['minor injury'], as_index=False).agg({'minor injury':'count'})['minor injury'][0]/ data.shape[0])*100, 2), 
    round((data.groupby(['minor injury'], as_index=False).agg({'minor injury':'count'})['minor injury'][1]/ data.shape[0])*100, 2)))
Percentage of non fatal class 99.93 and fatal class 0.07 
Percentage of non major class 96.42 and major class 3.58 
Percentage of non minor class 88.93 and minor class 11.07 

How many people involved in the accidents suffered injury?

In [140]:
crime = pd.DataFrame(data[['crime id', 'severity']].value_counts())
crime = crime.reset_index()
crime.columns = ['crime id', 'severity', 'count of people']
fig = px.histogram(crime, x="count of people", color='severity', marginal="box")
fig.show()

Which vehicle type has maximum accidents and most fatal accidents?

In [147]:
vhCount = pd.DataFrame(data[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Which state people travel most to dc or within dc?

In [161]:
vhCount = pd.DataFrame(data[['License state']].value_counts())
vhCount
vhCount.reset_index(inplace=True) 
vhCount.columns = ['License state', 'count']

fig = go.Figure(data=go.Choropleth(
    locations=vhCount['License state'], # Spatial coordinates
    z = vhCount['count'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Accidents",
))

fig.update_layout(
    title_text = 'Accidents by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Does the fatality rate of accidents changes for different state licenses?

As people are not familiar with the terrain, we observe that people holding licenses from different states have higher fatality rate

In [165]:
stateFatality = data[data['fatality rate']>0.0000]
stateFatalityAvg = stateFatality.groupby(['License state'], as_index=False).agg({'fatality rate':'mean'})
stateFatalityAvg

fig = go.Figure(data=go.Choropleth(
    locations=stateFatalityAvg['License state'], # Spatial coordinates
    z = stateFatalityAvg['fatality rate'], # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Accidents",
))

fig.update_layout(
    title_text = 'Average fatality rate by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Does age affect the fatality rate of an accident?

In [327]:
# fatality rate and age
fig = px.scatter(data, x="age", y="fatality rate", color='severity')
fig.show()

Which category of the persons type met with maximum accidents and suffered maximum injury?

In [142]:
ptCount = pd.DataFrame(data[['person type', 'severity']].value_counts())
ptCount.reset_index(inplace=True) 
ptCount.columns = ['person type', 'severity','count']

# fatal = ptCount[ptCount['fatal'] == 'Y']
# nonfatal = ptCount[ptCount['fatal'] == 'N']
# # x = ptCount['person type']
# # y = ptCount['count']

fig = px.bar(ptCount, x='person type', y='count', color='severity',
             title="Number of people met with an accident by person type",
             template="simple_white", 
             category_orders={ # replaces default order by column name
                 "person type": ["Driver", "Passenger", "Pedestrian", "Bicyclist"]
            },)

percentage = pd.crosstab(data['person type'], data['severity']).apply(lambda r: (r/r.sum())*100, axis=1)

fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, Minor: {}".format(
    (round(percentage['fatal']['Driver'], 2)),
    (round(percentage['major']['Driver'], 2)),
    (round(percentage['minor']['Driver'], 2))
    ), x="Driver", y=70000, showarrow=False)

fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, Minor: {}".format(
    (round(percentage['fatal']['Passenger'], 2)),
    (round(percentage['major']['Passenger'], 2)),
    (round(percentage['minor']['Passenger'], 2))
    ), x="Passenger", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, Minor: {}".format(
    (round(percentage['fatal']['Pedestrian'], 2)),
    (round(percentage['major']['Pedestrian'], 2)),
    (round(percentage['minor']['Pedestrian'], 2))
    ), x="Pedestrian", y=70000, showarrow=False)
fig.add_annotation(text="Accidents % <br> Fatal: {}, Major: {}, Minor: {}".format(
    (round(percentage['fatal']['Bicyclist'], 2)),
    (round(percentage['major']['Bicyclist'], 2)),
    (round(percentage['minor']['Bicyclist'], 2))
    ), x="Bicyclist", y=70000, showarrow=False)

# fig.layout.update(showlegend=False) 
fig.show()
  1. Data contains 4 types of persons such as driver, passenger, pedestrian and bicyclist
  2. Driver contributes 77% of data entries followed by 19 % passenger, 2 % pedestrians and 0.6% bicyclist
  3. Although, driver has maximum entries only 0.05% accidents are fatal whereas 2.63% and 8.04% are major and minor
  4. Passenger suffered maximum minor injuries in the accidents.
  5. Pedestrian do not frequently meet with an accident but when occurred they have suffered 53% minor injuries, 24% major and 0.7% fatal injuries. They have maximum fatalities compared to other types
  6. Similarly, bicyclist have least entries of accidents but when met with an accident they suffered 60.82% of minor injuries 8% major injuries and 0.22% fatal injuries

Analysis according to person type

Driver

In [172]:
driver = data[data['person type'] == 'Driver']
In [146]:
driver = driver[driver['age'] >1]
fig = px.histogram(driver, x="age", color='severity', marginal="box")
fig.show()
Does vehicle type affect the severity of accidents?
In [148]:
vhCount = pd.DataFrame(driver[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Passenger

In [232]:
passenger = data[data['person type'] == 'Passenger']
Out[232]:
array(['Driver', 'Passenger', 'Pedestrian', 'Bicyclist'], dtype=object)
In [ ]:
passenger = passenger[passenger['age'] >1]
fig = px.histogram(passenger, x="age", color='severity', marginal="box")
fig.show()
In [236]:
vhCount = pd.DataFrame(passenger[['vehicle type', 'severity']].value_counts())
vhCount.reset_index(inplace=True) 
vhCount.columns = ['vehicle type', 'severity','count']

fig = px.bar(vhCount, x='vehicle type', y='count', color='severity',
             title="Number of accidents by vehicle type",
             template="simple_white")
# fig.layout.update(showlegend=False) 

fig.show()

Pedestrian

In [314]:
pedestrian = data[data['person type'] == 'Pedestrian']
In [315]:
pedestrian = pedestrian[pedestrian['age'] >1]
fig = px.histogram(pedestrian, x="age", color='severity', marginal="box")
fig.show()

Model Selection

Random Uniform Under Sampling

In [317]:
w_class0 = len(np.where((data['fatal'] == 0) & (data['vehicle type'] == 'Passenger Car/automobile'))[0])
w_class1 = len(np.where(data['fatal'] == 1)[0])
w_class1_downsampled = np.random.choice(w_class0, size=w_class1, replace=False)
a = data[(data['fatal'] != 0)]
b = data[data.index.isin(w_class1_downsampled)]
a = a.append(b)
In [321]:
a['severity'].value_counts()
Out[321]:
fatal    417
0        348
minor     41
major     28
Name: severity, dtype: int64

Relationship Strengths

In [323]:
dython.nominal.associations(a[['severity', 'fatal', 'fatality rate','vehicle type', 'ticket issued','Region', 'impaired','age', 'speeding', 'person type']], theil_u=True)
Out[323]:
{'ax': <matplotlib.axes._subplots.AxesSubplot at 0x7f3ddb0d6090>,
 'corr':                severity     fatal  ...  speeding  person type
 severity       1.000000  1.000000  ...  0.212727     0.105321
 fatal          1.000000  1.000000  ...  0.212132     0.359681
 fatality rate  0.798077  0.798077  ...  0.270035     0.264249
 vehicle type   0.140067  0.424649  ...  0.362690     0.459223
 ticket issued  0.091293 -0.049995  ...  0.170816     0.235545
 Region         0.049207  0.279916  ...  0.065214     0.297183
 impaired       0.085126  0.085126  ... -0.021497     0.138003
 age            0.158132  0.036961  ...  0.031085     0.447117
 speeding       0.212727  0.212132  ...  1.000000     0.174252
 person type    0.096710  0.359681  ...  0.174252     1.000000
 
 [10 rows x 10 columns]}

Multiple Linear Regression

In [324]:
X = a[['vehicle type', 'age', 'Region', 'speeding', 'person type']]
y = a[['fatality rate']]
X = pd.get_dummies(data = X)
X['age'] = X['age'].fillna(0)

import statsmodels.api as sm
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model

# Note the difference in argument order
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()
Out[324]:
OLS Regression Results
Dep. Variable: fatality rate R-squared: 0.216
Model: OLS Adj. R-squared: 0.193
Method: Least Squares F-statistic: 9.304
Date: Sun, 04 Apr 2021 Prob (F-statistic): 1.14e-29
Time: 21:14:09 Log-Likelihood: -3871.9
No. Observations: 834 AIC: 7794.
Df Residuals: 809 BIC: 7912.
Df Model: 24
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 18.6436 4.396 4.241 0.000 10.015 27.272
age 0.2285 0.045 5.106 0.000 0.141 0.316
speeding 27.9347 4.030 6.931 0.000 20.023 35.846
vehicle type_Atv (all Terrain Vehicle) 33.8964 24.137 1.404 0.161 -13.483 81.276
vehicle type_Bus -20.9126 7.571 -2.762 0.006 -35.773 -6.052
vehicle type_Cargo Van -6.4207 12.539 -0.512 0.609 -31.034 18.192
vehicle type_Drugs/ Narcotics 4.8642 7.794 0.624 0.533 -10.434 20.162
vehicle type_Firearms -1.5009 6.264 -0.240 0.811 -13.796 10.794
vehicle type_Large/heavy Truck -4.4145 10.324 -0.428 0.669 -24.679 15.850
vehicle type_Moped/scooter 13.7245 10.407 1.319 0.188 -6.703 34.153
vehicle type_Motor Cycle 22.9217 5.146 4.454 0.000 12.820 33.023
vehicle type_Motorhome/camper/rv (recreational Vehicle) -10.3019 18.139 -0.568 0.570 -45.907 25.303
vehicle type_None 3.4435 11.428 0.301 0.763 -18.989 25.876
vehicle type_Other Small/light Truck -6.0372 14.187 -0.426 0.671 -33.885 21.811
vehicle type_Other Vehicle -4.3875 3.816 -1.150 0.251 -11.879 3.104
vehicle type_Passenger Car/automobile -1.1709 3.128 -0.374 0.708 -7.311 4.969
vehicle type_Passenger Van 3.5861 5.613 0.639 0.523 -7.431 14.603
vehicle type_Pickup Truck -4.2755 8.492 -0.503 0.615 -20.945 12.394
vehicle type_Suv (sport Utility Vehicle) -4.3712 4.604 -0.949 0.343 -13.408 4.665
Region_Midwest -11.0248 8.192 -1.346 0.179 -27.104 5.055
Region_Northeast -12.4382 7.356 -1.691 0.091 -26.876 2.000
Region_South -3.8695 3.591 -1.077 0.282 -10.919 3.180
Region_West -12.2348 12.953 -0.945 0.345 -37.660 13.191
person type_Bicyclist 10.1078 7.326 1.380 0.168 -4.272 24.488
person type_Driver -3.9253 4.797 -0.818 0.413 -13.341 5.490
person type_Passenger -0.5231 5.123 -0.102 0.919 -10.579 9.533
person type_Pedestrian 12.9841 8.348 1.555 0.120 -3.403 29.371
Omnibus: 226.564 Durbin-Watson: 1.033
Prob(Omnibus): 0.000 Jarque-Bera (JB): 509.393
Skew: 1.476 Prob(JB): 2.44e-111
Kurtosis: 5.438 Cond. No. 4.90e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.72e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Random Forest

In [325]:
# define pipeline
# over = RandomOverSampler(sampling_strategy=0.1)
def getRF(df):
  X = df[[ 'age', 
          'vehicle type',
        'ticket issued', 'Region', 'impaired', 'speeding']]
  X = pd.get_dummies(data = X)
  X['age'] = X['age'].fillna(0)
  y = df[['fatal']]
  # under = RandomUnderSampler(sampling_strategy=0.5)
  # steps = [ ('u', under), ('m', RandomForestClassifier(random_state=1))]
  steps = [ ('m', RandomForestClassifier(random_state=1))]

  pipeline = Pipeline(steps=steps)
  # evaluate pipeline
  cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=1)
  scores = cross_val_score(pipeline, X, y, scoring='f1', cv=cv, n_jobs=-1)
  score = np.mean(scores)
  print('F1 Score: %.3f' % score)
  pipeline.fit(X, y)
  rf = pd.DataFrame(pipeline.named_steps['m'].feature_importances_, index=X.columns)
  rf['features'] = rf.index
  rf.columns = ['score', 'features']
  rf['score'] = rf['score'].round(2)
  rf = rf.sort_values('score', ascending=False)
  fig = px.bar(rf, x='features', y='score', color='features',
              title="Random forest feature importance",
              template="simple_white",  text='score')
  fig.layout.update(showlegend=False) 

  fig.show()
getRF(a)
F1 Score: 0.628

Discussion